Parallelized Method

RDD can be created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5.

RDD From Array
val rdd = sc.parallelize(Array("sun","mon","tue","wed","thu","fri"),4)
rdd.collect


rdd = sc.parallelize(("sun","mon","tue","wed","thu","fri"),4)
rdd.collect()

RDD From List
val rdd =sc.parallelize(List(1,2,3,4,5))
rdd.collect

rdd =sc.parallelize([1,2,3,4,5])
rdd.collect()

val rdd = sc.parallelize(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "mongoBB")))
rdd.collect

rdd = sc.parallelize([(2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "mongoBB")])
print(rdd.collect())
RDD From Nested List
val Student = List(("IT",("Pradeep",50000.0)),("CS",("CP",50200.0)),("ECE",("Deepak",62200.0)),("Mech",("Narpat",65000.0)))
val StudentRDD = sc.makeRDD(Student)
StudentRDD.collect
Student = List(("IT",("Pradeep",50000.0)),("CS",("CP",50200.0)),("ECE",("Deepak",62200.0)),("Mech",("Narpat",65000.0)))
StudentRDD = sc.makeRDD(Student)
StudentRDD.collect

RDD Using Range
val rdd = sc.parallelize(1 to 10, 3)
rdd.collect


RDD From Sequence
val rdd = sc.parallelize(Seq(("Seq1", 1),("Seq2", 2),("Seq3", 3)))
rdd.collect


rdd = sc.parallelize(Seq(("Seq1", 1),("Seq2", 2),("Seq3", 3)))
rdd.collect

Empty RDD
var rdd=sc.parallelize(Seq.empty[String])
rdd.collect


How to access the elements of the RDD.
val datasets = sc.parallelize(List(("A","HYD",10),("B","BLR",30),("A","HYD",40),("B","BLR",50),("C","DEL",60)))
datasets.map(x => (x._1,x._2)).collect

Accessing the first and second elements of the RDD.
 
How to access the elements the particular elements of an RDD
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd= rdd.keyBy(_.length)
val filterRdd = kvRdd.filter(p => p._1 == 4)
filterRdd.collect

 
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd= rdd.keyBy(_.length)
val filterRdd = kvRdd.filter(p => p._1 == 4)
kvRdd.lookup(4)
 

No comments:

Post a Comment